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Abstract 

Background: Most Bayesian models for the analysis of complex traits are not analytically tractable and inferences 
are based on computationally intensive techniques. This is true of Bayesian models for genome-enabled selection, 
which uses whole-genome molecular data to predict the genetic merit of candidate animals for breeding purposes. 
In this regard, parallel computing can overcome the bottlenecks that can arise from series computing. Hence, a 
major goal of the present study is to bridge the gap to high-performance Bayesian computation in the context of 
animal breeding and genetics. 

Results: Parallel Monte Carlo Markov chain algorithms and strategies are described in the context of animal 
breeding and genetics. Parallel Monte Carlo algorithms are introduced as a starting point including their 
applications to computing single-parameter and certain multiple-parameter models. Then, two basic approaches for 
parallel Markov chain Monte Carlo are described: one aims at parallelization within a single chain; the other is 
based on running multiple chains, yet some variants are discussed as well. Features and strategies of the 
parallel Markov chain Monte Carlo are illustrated using real data, including a large beef cattle dataset with 50K 
SNP genotypes. 

Conclusions: Parallel Markov chain Monte Carlo algorithms are useful for computing complex Bayesian models, 
which does not only lead to a dramatic speedup in computing but can also be used to optimize model 
parameters in complex Bayesian models. Hence, we anticipate that use of parallel Markov chain Monte Carlo 
will have a profound impact on revolutionizing the computational tools for genomic selection programs. 



Background high-dimensional model based on Markov chain Monte 

In recent decades, Bayesian inference has been increas- Carlo (MCMC) techniques is notoriously intensive in 

ingly used for analysis of complex statistical models, in computing and often requires days, weeks, or even 

part because of increased availability and performance of months of CPU (Central Processing Unit) time on per- 

personal computers and workstations. However, such sonal computers and workstations [2]. Therefore, in 

models are generally not analytically tractable and, order to overcome such computational burden, parallel 

hence, computationally demanding numerical techniques computing becomes appealing [3,4]. 

are inevitably required. This is especially true of Bayesian Parallel computing operates on the principle that a 

computation for genome-enabled prediction and selec- large problem can be split into smaller components and 

tion, which aims at using whole-genome molecular data solved concurrently (i.e., "in parallel"), each on a separ- 

to predict the genetic merit of candidate animals for ate processor (or CPU core) [5]. An instance of a com- 

breeding purposes [1]. Typically, implementation of a puter program and its activities that are taking place on 

each processor is referred to a process. Thus, parallel 
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bining results by the main "controlling" process. Parallel 
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computing can be achieved by programming with C, C++, 
or Fortran, e.g., using the MPI (Message Passing Interface) 
Ubrary to handle inter-process communication [6] . High- 
performance computing communities have developed 
parallel programs for decades but were previously limited 
to programs running on expensive super-computers. In 
the past twenty years, interest in parallel computing has 
grown markedly due to physical constraints that prevent 
frequency scaling [5] and to the need to handle datasets 
of unprecedented dimensionalities that are being gener- 
ated [7] . Parallel computing has now become a dominant 
paradigm in current computer architectures, mainly in 
the form of multi-core processors [8]. 

Parallel MCMC methods have recently been adopted 
in statistics and informatics [4,9] and in image proces- 
sing [10] but they have not received much attention in 
animal breeding and genetics. There are several reasons 
for this gap. First, MCMC algorithms are seemingly ser- 
ial, and parallelism is not as straightforward as one 
would expect. Second, many intensive computational 
tasks in breeding and genetics applications have been 
handled via some simple data parallelism, implemented 
through the "multiple-tasking" mechanism provided by 
multi-core Linux workstations. Multiple-tasking allows 
each processor to switch between tasks being executed 
on it, without having to wait for each task to finish, but 
this type of "parallel" computing is not scalable with the 
number of jobs. Recently, parallel MCMC algorithms 
and strategies have become a focal point for scientific 
computing in the post-genome era [4]. This is largely 
due to the need to handle genomic datasets of unprece- 
dented sizes, such as genome-wide dense markers or 
sequences for genome-enabled selection programs [2]. 
With a set of whole-genome markers (say 50K SNP 
markers or higher density) in a model, the computing 
task is highly challenging, particularly with sophisticated 
Bayesian models via MCMC implementation [1,11-13]. 

In this paper, we present a technical description of par- 
allel MCMC methods in the context of animal breeding 
and genetics. These algorithms typically fall into two cat- 
egories: running multiple MCMC chains or parallelization 
within a single chain; some variants of these algorithms 
are discussed as well. A major purpose of this paper is to 
advocate the use of parallel MCMC methods, hence in- 
fusing high-performance computing technologies into 
animal selection programs in the post-genome era. 

Methods 

Parallel Monte Carlo methods 

We start with parallel Monte Carlo methods, as a prelude 
to parallel MCMC. In practice, many statistical problems 
involve integrating over hundreds or even thousands of 
dimensions but usually these problems are not analytic- 
ally tractable. Instead, Monte Carlo simulation methods 



[14] can be used to tackle high-dimensional integrals. 
Standard Monte Carlo integration algorithms distribute 
the evaluation points uniformly over the integration 
regions. 

Parallel computing for evaluating integrals 

To begin, consider the following integral 

E{p{d)) = J p{6)md9, (1) 

for some high-dimensional 6 with density f{6). Suppose 
the integral cannot be evaluated analytically. If n realiza- 
tions of 6 can be sampled independently from f{d) 
then, according to the strong law of large numbers, the 

sample average ^ Z p(^d^'^^ provides an approximation 

to E{p{6)) when « ^ oo. 

Simple Monte Carlo algorithms proceed by averaging 
large numbers of values that are generated independ- 
ently of each other. Obviously, Monte Carlo simulation 
is parallel in computing because it can be conducted 
concurrently. By parallel computing, the entire set of 
samples can be divided among the available CPU cores 
and then each core generates a portion and summarizes 
its local samples. After all processors have finished their 
tasks, a master program summarizes all the partial data 
and outputs the final result. 

Suppose that there are K CPU cores that generate a 
total of T samples, each handling an equal portion of 
these samples. For simplicity, we assume that T is 
divisible by K, such that the quotient {m = T/K) is an in- 
teger. Then, parallel Monte Carlo simulation proceeds as 
follows [3]: 

-* Process 0 (master process): 

(a) computes and passes m to each process. 

-* Each (slave) process (say /): 

(a) simulates m independent realizations of d; 

(b) computes Sj = Z p(^6^'^^ , and passes Sj back to 
the master program. 

-* Process 0 (master program): 

K 

(a) sums Sj and generates the final sum S = Z Sf, 

(b) computes the Monte Carlo estimate as '' 

E{p{e)) = ^. 

Note that, in this example, the master process does not 
involve computing the sum of a portion of the data but it 
actually can. Also, note that each process is given the same 
number of samples. This works well if all CPU cores 
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process the data at the same speed or approximately so. In 
practice, however, cloclc frequencies (i.e., computing speed) 
can vary markedly among processors. Hence, it can be 
more effective for each processor (or CPU core) to process 
a different number of samples, roughly proportional to its 
computing speed, and then let the master program 
compute the weighted average of all samples obtained from 
the K cores. 



Parallel computing of single-parameter models 

A single-parameter model can serve as a building block 
for Bayesian modeling [15]. Consider a normal distribu- 
tion with known mean /4 and unknown variance tr^ to 
be inferred. The data density for a vector y of n identi- 
cally independently distributed {iid) observations is: 



posterior density /(cr-^ I y), then, as m — *■ cxd, £(ff^|y) 
can be approximated by the sample average: 



Z 

1=1 



E( 



(6) 



P[Y 



a oc 



exp 



2 a 



(2) 



where = ^ Z (y,- — ^) is the sufficient statistic. As- 
suming an inverse- prior distribution with scale 
and Vo degrees of freedom. 



Parallel computing of multiple-parameter models 

Many models involve more than one unknown. Although 
many parameters are involved, conclusions are often 
drawn about one or only a few parameters at a time. In 
Bayesian analysis, the aim is to obtain the marginal pos- 
terior distribution of each parameter of interest. Often, 
we can construct the joint posterior distribution of all 
unknowns and then integrate this distribution over the 
unknowns that are not of immediate interest, leading to 
the desired marginal distribution of the parameter of 
interest. 

Now, consider the normal distribution (2) but with 
both mean and variance unknown. Assuming prior inde- 
pendence of location and scale parameters, a vague prior 
density for ^ and cr^ is uniform on ( ^, log a) that is. 



(7) 



exp 



2 0-2 



(3) 



Then, it can be shown that the marginal posterior dis- 
tribution of is a scaled inverse- density with n — 



it can be shown that the posterior density of \s di 1 degrees of freedom and scaling parameter 
scaled inverse- redistribution with scale "° ""^ " ^ and 

^ i;o+ " 

vo-\-n degrees of freedom [15]: 



(8) 



Vo 



Vo ■ 



n 



(4) 



Hence, the posterior mean of is for vo -\- 

n > 2. Numerically, the posterior distribution of can 
be inferred based on posterior samples generated from 
(4). Computing for this single-parameter normal model 
can follow exactly the same algorithm as parallel Monte 
Carlo simulation. Briefly, K parallel processes are exe- 
cuted, each generating a portion of the posterior samples 
of cr^. Then, the master process generates the final sum 
and computes the estimated posterior mean of as a 
weighted average of all sample averages. 

To show why the algorithm of parallel Markov chain 
simulation applies to parallel computing of a single- 
parameter model, consider equation (1). For this 
single-parameter normal model, for example, the mar- 
ginal posterior expectation of can be expressed as: 



2 

where = Z (y,- —y) and jy = 77 ^ Yi- The mar- 
ginal posterior distribution of can be obtained by inte- 
grating the joint posterior density over leading to a 
student-t density: 



P{f^\ y) =ip{l^,o^\y)da^ 



tn-i\ y 



(9) 



E{a'\y) = ja'f{a'\y)da' 



(5) 



Clearly, (5) implies a similar Monte Carlo implementa- 
tion: if n samples of are generated from its marginal 



Therefore, posterior samples for and can be gen- 
erated independently from the following marginal pos- 
terior distributions, for i = 1, . . . , T iterations: 

-* Sampling cr^f^' from (8), 
-* Sampling from (9). 

Parallel Markov Chain Monte Carlo 

Analytical solutions are not always available for most 
multiple-parameter models. Instead, MCMC simulation 
can be used to draw samples from the joint posterior dis- 
tribution and then evaluate sampled values for the par- 
ameter(s) of interest while ignoring the values of other 
unknowns. MCMC methods are a variant of Monte 
Carlo schemes in which a Markov chain {Xy, /= 1,2} 
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is constructed with equilibrium distribution n equal to 
some distribution of interest, such as a posterior distribu- 
tion in a Bayesian analysis [16]. Typically, the initial value 
is not a draw from the distribution n but if the chain is 
constructed properly, then Xt n (here, d means con- 
vergence in distribution) and, under certain conditions, 
an estimator h converges to /i^i as i — > oo . However, a 
Markov chain is sequential by nature because the distri- 
bution of Xt+\ depends on the value of Xt , where t in- 
dexes the order of MCMC iterations. This introduces a 
difficulty to parallelization of a Markov chain. 

Parallel MCMC by running multiple chains 

A naive yet natural approach to parallel MCMC is simply 
to generate several independent Markov chains on differ- 
ent processors and then combine results appropriately 
[17,18]. Given that running multiple chains is simple and 
that they scale well with the number of available proces- 
sors (or CPU cores), this type of "multiple-chain" parallel- 
ism is usually the strategy to strive for in the first instance. 

Assume that we want to estimate some target distribu- 
tion p{X) but samples of X cannot be drawn directly 
from p{X). Instead, a Markov chain Xq, Xy, . . . can be 
generated, which, through some transition density 
u{ Xt+i\ Xt), converges to p{X) at equilibrium. Let 
there be i = 1,2, . . . , K parallel chains, each initialized 
and burned-in independently for updating steps be- 
fore more samples are drawn at intervals. As K —^ oo 
and all B i —^ cxd it can be shown that the ensemble is 
ergodic (tending in the limit) to p{X)[19]. 

An appealing advantage of running multiple chains is 
that these processes can be conducted concurrently with 
minimal coordination among tasks, as in the case of par- 
allel Monte Carlo simulation. However, unlike parallel 
Monte Carlo simulation, a major concern with running 
multiple MCMC is that the overall reduction in runtime 
from parallelism can be limited by the portion of each 
chain to be discarded in the beginning of MCMC sam- 
pling for convergence purposes (i.e., burn-in). If every 
chain has to spend a significant proportion of its time in 
burn-in, this would place serious limitations on the per- 
formance of the algorithm, because it would not scale 
well with an increasing number of processors [4]. 
According to Amdahl's (1967) law [20], if some portion 
p of steps, for 0 < p<l^ must be removed as burn- in 
from each chain, then the maximum speedup in com- 
puting through parallelization is (assuming that each 
step takes an equal amount of time): 

Sip)= ^"+" r^ool + i- (10) 
pn+ nK-^ = p ^ ' 

where n is the number of iterations after burn-in. Thus, 
parallel MCMC computing by virtue of running multiple 



chains is rewarding only when p is small. However, if p 
is large, the gain in computation through running mul- 
tiple chains instead of a single long chain can be very 
disappointing. 

Although running multiple Markov chains is theoretic- 
ally straightforward, chains are not necessarily ergodic. 
Hence, some variant multiple MCMC methods have 
been proposed. For example, samples from multiple 
Markov chains may be confined to isolated modes if the 
target distribution is multi-modal, or the chains may 
mix poorly when there are strong correlations between 
variables. Unfortunately, the latter is a common problem 
of Gibbs samplers [21]. Hence, pooling samples from 
multiple short chains may not necessarily give a better 
representation of p{X) than using a single long chain. If 
several chains are drifting to disparate modes, they will 
tend to be strongly influenced by the chains that they 
confine, because the weights will not necessarily be pro- 
portional to their relative masses. 

Several strategies have been proposed for handling the 
aforementioned issues for single chains, such as adaptive 
MCMC algorithms [16,22] and tempering [23,24]. 
Metropolis-coupled MCMC is an algorithm that is 
related to simulated tempering and tempered transitions 
[23,24] . It proceeds by simultaneously running a number 
of different Markov chains that are governed by different 
(but related) Markov chain transition probabilities. Oc- 
casionally, the algorithm "swaps" values between two dif- 
ferent chains, with probability governed by the 
Metropolis algorithm to preserve the stationarity of the 
target distribution. These swaps can speed up conver- 
gence of the algorithm substantially [4]. Craiu et al. [25] 
targeted the posterior with an ensemble of chains, using 
the covariance of samples across all chains to adapt the 
proposal covariance for a set of Metropolis-Hastings 
chains. While these multiple-chain methods use syn- 
chronous exchange of samples to expedite convergence, 
Murray [26] proposed mixing in an additional independ- 
ent proposal, representing some hitherto best estimate 
or summary of the posterior, and cooperatively adapting 
across chains. The idea is to construct a global best 
estimate of the posterior at any given step and then 
mix this estimate as a remote component with what- 
ever local proposal that a chain has adopted. This does 
not preclude adaptive treatment or tempering of that 
local proposal. It also permits a heterogeneous blend 
of remote proposals, so that the ensemble of chains 
can mix well. 

Parallelization within a single chain 

By running multiple Markov chains, we often observe 
that samplers mix poorly and each chain may require a 
very long burn-in time. Hence, it would be preferable to 
develop parallelism within a single chain, instead of 
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running multiple chains. As mentioned in the previous 
section, Markov chain simulation is an iterative proced- 
ure, in the sense that simulation of the next value of the 
chain depends on the current value. This creates diffi- 
culty for delivering parallelism for a single Markov 
chain. Nevertheless, we will show that a single chain can 
be parallelized, subject to assumptions of conditional 
independence in the model. The key is to identify such 
steps that can be implemented in parallel. 

Consider a Bayesian model with p scale parameters 
a= (ffi, o'2, cTp), where p can be equal to 1 in some 
cases, and q location parameters 0 = (0i, 02, Oq). In 
MCMC sampling, each element is updated once per iter- 
ation using a kernel density that preserves the desired 
target posterior distribution p{(r, 0\ y). Assume that up- 
dating a is very fast (given some sufficient statistics 
regarding the current state of 6) but updating 0 is highly 
time-consuming. This is typical of a multivariate normal 
distribution with a common scale parameter (or different 
groups of scale parameters) and a large number (say a 
few hundreds or thousands) of location parameters. In 
these cases, it would be preferable to parallelize the up- 
date steps for 6 in order to gain speed up in computing. 
In theory, parallelization of the update of 6 depends cru- 
cially on the conditional independence structure of the 
model. First, assume the simplest possible case, where 
d il. 6 j\ a, y, i j, meaning that the update of any 
particular 6 , will not depend on the state of any other 
8 j (77^ i). Thus, all 0s can be updated in parallel by 
delivering subsets, say q^, of the elements in 0 to the K 
CPU cores. For illustrative purpose, let there be only one 
a but many 0s. Then, after all parameters are given ini- 
tial values, the parallel MCMC algorithm proceeds by 
repeatedly conducting the following steps: 

-> Master program: 

(a) samples a new a, given realization of 0 and the data 
y, and 

(b) distributes the new a to each process. 

Each process (/c): 

(a) updates a subset of 0s that have been assigned to it, 
conditional on a and y, 

(b) computes summary statistics for the updated 6s, 
and 

(c) passes the summary statistics back to the master 
program. 

Often, the above algorithm works quite well when the 
8 are all independent of one another, given a and y. In 



practice, however, such independence may not necessar- 
ily hold and strategies must be developed to deliver effi- 
cient parallel MCMC algorithms given specific 
dependence between elements [3]. 

Applications 

Parallel simulation for a single-parameter normal model 

Consider a normal distribution model with unknown ^ 
and known a^. For a vector y of « iid observations, the 
likelihood is: 

(11) 

If a normal prior is assumed, that is, 

p{n)(x exp ^- ^ ( ^ - /^of^ (12) 

where f^Q and Tq are hyperparameters, it can be shown 
that the posterior density of fi is also normal [15]: 

.Wr)^«(-^^,(^ + ^) ). (13) 

Intuitively, the posterior mean of 8 is expressed as a 
weighted average of the prior mean ( ^q) and of the sam- 
ple mean (y), with weights equal to the corresponding 
precisions, and , respectively. Because this is a 

single-parameter model, posterior samples of fi can be 
simulated in parallel by following the same algorithm as 
for Monte Carlo simulation. 

The example data are average body weight daily gains 
(ADG) measured on 7670 Angus cattle. The kernel 
density of ADG is shown in Figure 1, which approxi- 
mately suggests a normal distribution. Assume that we 
know, from previous studies, that the population vari- 
ance of ADG is 0.58. In this example, the prior distribu- 
tion is assumed to be normal with mean equal to 4.0 
and variance equal to 1.0 (these are just guesses of the 
parameter values in the distribution of ADG). A parallel 
C program was used in this analysis (Appendix). To 
compile the parallel program, say using MPICH2, type: 
mpicc singNormMod_Parallel.c -o singNormP -Im 
[enter]. To conduct computing in parallel, type: mpirun 
-np XX ./singNormP [enter], where xx is the number of 
processors involved (or CPU cores). To estimate ^, we 
simulated a total of 1 000 000 values for which were 
handled by K = 10 processes, each generating 100 000 
values and computing the partial sum. Then, the K par- 
tial sums were transferred back to process 0, where the 
Monte Carlo estimate was computed. The illustrative 
program only outputs the posterior mean and the 
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Standard deviation. The original program used in the 
demonstration also outputs minimum and maximum 
values, and quartiles. This part of the code is omitted in 
the Appendix for simplicity of demonstration. The com- 
puting was conducted on a DELL Precision workstation 
equipped with Intel® Xeon® CPU (3.20GHz), 16G mem- 
ory, and cache size 6144KB. 

The posterior mean was estimated to be 3.394, which 
corresponded very well to the sample mean of ADG of 
the 7670 Angus cattle (Table 1) because the impact of 
the prior on the posterior could be ignored given the 
very large sample size. The median and mean agreed 
well with each other and the first and third quartiles 
were also very similar (Table 1). These are indications 
that the posterior distribution of the mean of ADG is 
symmetric. 

The purpose of this example was to show parallel 
computing using the MPI (Message Passing Interface) li- 
brary. The change in computing time for this example 
was, however, almost insignificant because sampling 
from a normal distribution is very quick. In addition, 
with parallel simulation, inter-process communication 
requires some extra time as overhead, which offset gains 
from parallel computing. 

MPI is a language-independent communication 
protocol used to program parallel computers that is 
extensively used for high-performance computing. 
More specifically, MPI is a library of routines for cre- 
ating parallel programs e.g., in C or Fortran 77, that 
allow users to create programs that can run on most 
parallel computer architectures. (Note that there 



is a language extension to Fortran90 called High Per- 
formance Fortran - HPF, which supports high- 
performance computing.) In the example code, the 
MPI library was used to handle inter-process commu- 
nications in the C program. With MPI, each task can 
have its own local memory during computation (but 
multiple tasks can reside in the same physical machine 
and/or an arbitrary number of machines). Typically, tasks 
exchange data by sending and receiving messages but 
data transfer usually requires cooperation among 



Table 1 Posterior summary statistics of average body 
weight daily gain based on a single-parameter normal 
model 



Sample set 


Min 


Q1 


Median 


Mean 


Q3 


Max 


1 


3.357 


3.388 


3.394 


3.394 


3400 


3.431 


2 


3.356 


3.388 


3.394 


3.394 


3400 


3429 


3 


3.352 


3.388 


3.394 


3.394 


3400 


343 


4 


3.357 


3.388 


3.394 


3.394 


3400 


3436 


5 


3.353 


3.388 


3.394 


3.394 


3400 


3432 


6 


3.355 


3.388 


3.394 


3.394 


3400 


343 


7 


3.355 


3.388 


3.394 


3.394 


3400 


3.431 


8 


3.354 


3.388 


3.394 


3.394 


3400 


3428 


9 


3.356 


3.388 


3.394 


3.394 


3400 


3.431 


10 


3.358 


3.388 


3.394 


3.394 


3400 


3433 


Pooled 


3.352 


3.388 


3.394 


3.394 


3.400 


3.436 



Min = minimum value; Q1 =first quartile; Q3 = third quartile; max = maximum 
value. 

The above results were obtained from parallel computing on ten CPU cores. 
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processors, that is, a "send" operation must have a 
matching "receive" operation. 

A few details about this program in the Appendix are 
described in the following. MPI_Comm_rand() is used 
to find out the ID of all participating processors and 
MPI_Comm_size() is used to get the number of partici- 
pating processors. A common pattern of interaction 
among parallel processors is to use MPI_Send() and 
MPI ReceiveO to allocate work among them. In the 
present example, however, this was done in a slightly dif- 
ferent manner. MPI_Bcast() is used to send common 
parameter values (e.g., number of simulation steps) to all 
participating processors. Then, after each processor has 
finished its work, the subroutine MPI_Reduction() is 
used to sum up the posterior values from all processors. 
Subroutine MPI_Reduction() collects data from all pro- 
cessors, reduces the data to a single value (e.g., by sum- 
mation), and then stores the results on the master 
process (and on all processes as well). There are several 
predefined operations that MPI_Reduction() can pro- 
vide. In addition to summation, it can also conduct 
multiplication, and find minimum or maximum values. 
Finally, the master processor computes the means and 
standard deviation (and other posterior statistics, when 
relevant) for the mean of the normal model. Note that, 
in this illustration, we used sequential functions to gen- 
erate random numbers (http://apps.nrbook.eom/c/index. 
html), with process ID used as the random number seed. 
Preferably, one can use a parallel random number gener- 
ator, such as the Scalable Parallel Random Number Gen- 
erators (SPRNG) Library (http://sprng.cs.fsu.edu/). 

Running multiple chains for Bayeslan LASSO modeling 

In this example, we show how to parallelize multiple 
chains for the Bayesian LASSO (Least Absolute Shrink- 
age and Selection Operator) regression model [12,27]. In 
the whole-genome prediction context, consider the fol- 
lowing linear model: 

y = 1 ^ + Xp+ e, (14) 

where y is a vector of phenotypes, ^ is an effect common 
to all observations, (i is a vector of unknown marker 
effects, X is an incidence matrix, and e is a vector of 
residuals. The prior specification follows de los Campos 
et al. [12] but without the terms for additive genetic 
effects. An R package, BLR, was used to implement the 
Bayesian LASSO model [28]. 

The dataset used here consisted of 147 Angus cattle, 
each genotyped for 37 892 polymorphic SNP (Single 
Nucleotide Polymorphisms) markers and with esti- 
mated breeding values (EBV) for marbling score as re- 
sponse variable. In addition to running a single long 
chain of 100 000 iterations (after a burn-in of 1000 



iterations), we also ran 10 chains, each consisting of 
10 000 iterations (after a burn-in of 1000 iterations). 
All jobs were submitted and run on a Condor cluster 
at the University of Wisconsin - Madison [29]. This 
cluster provides 1860 cores for distributed parallel 
computing. Among them, 1847 run a Linux operation 
system and the remaining a Windows operation sys- 
tem. Memory size ranges from 256M to 214G: 7.04% 
(<1G), 67.58% (1-3G), 22.69% (4-8G), and 2.67% (>10G). 
A Perl script was used, that installs the R system and 
required libraries (such as the SuppDists package) onto re- 
mote nodes prior to the computing and then executes the 
Bayesian LASSO program. This Perl script served as the 
executable in the Condor job batch file. 

For the multiple chains approach, each started with 
over-dispersed initial values (and with different seeds for 
the random number generator), and the chains con- 
verged after a certain number of iterations. Markov 
chain Monte Carlo convergence was examined using 
posterior samples of the residual variance collected from 
the first 4500 iterations of each chain. Trace plots of 
posterior samples of the residual variance showed that 
most chains tended to stabilize after 1000 iterations, and 
all approached 0.0034 (Figure 2a), which corresponds 
well to estimated residual variance in this example. The 
trace plot of the shrink factor V from the Gelman 
and Rubin method [17] suggested that a burn-in of 3000 
iterations would be more appropriate, because V 
approached 1.00 after the first 3000 iterations 
(Figure 2b). With \/ R^ —^ 1, within-sequence variance 
dominates between-sequence variance, and all sequences 
escape the influence of starting points and traverse all 
target distributions. The same convergence diagnosis 
can be done for all model parameters. 

The parallel computing took between 141 and 178 min 
to complete each of the 10 processes. The differences 
were due to varying CPU speeds and workloads on these 
computer nodes. In contrast, running a single chain with 
100 000 iterations (after a burn-in of 1000 iterations) on 
a Linux workstation with similar specifications took 1386 
min (Figure 3a). Thus, the reduction in runtime from 
parallel computing was approximately 7.78 fold. Poster- 
ior estimates of the model parameters were similar be- 
tween the two computing approaches (data not 
presented). 

When running multiple chains, the reduction in run- 
time is limited by the time required for burn-in. Let b 
denote the number of burn-in iterations that is required, 
and « be the number of iterations after the burn-in. 
Then, in a serial implementation, the chain will consist 
oi b + n iterations in total. Let there be K processes run- 
ning chains in parallel, each taking on an equal length of 
Markov chain (i.e., b + n/K iterations). Assuming each 
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Figure 2 Markov chain Monte Carlo convergence diagnosis, (a) Trace plots of posterior samples of residual variance obtained within 4500 
iterations from each of the 10 chains; (b) trace plot of shrink factor ^/R according to the Gelman-Rubin method y. 



iteration takes the same amount of processing time, the 
reduction in runtime is given by: 



SiK) 



n K- 



K 



oo 1 



(15) 



The above is an alternative form of (10) with b = p n 
and K treated as unknown. Ideally, if b = Q, this means 



"perfectly parallel computing", in which the potential re- 
duction in runtime is /C-fold. However, when ^ is a sig- 
nificant proportion of n, the actual speedup falls well 
short of its potential. 

Practically, burn-in time is related to the mixing rate 
of the chain, which is related to the total length of the 
Markov chain. Let n = \Qb, which is a useful rule-of- 
thumb in most practical situations. Then, equation (15) 
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depends only on parameter K. Hence, we have 

5(^^ = 8) = ITW8«5and S(/C = 16) = < 7. re- 

spectively, for K=8 and K=16. As oo, the 
speedup is upper-bounded at: 



S{K) 



b+lOb 



b+10 bx /C-i 



11. 



(16) 



Thus, when running multiple chains, each with a signifi- 
cant length of burn-in, the speedup does not scale well 
with the number of available CPU cores. Let n = t x b. 
Then, the speedup 5 is a function of t and K, as depicted 
in Figure 3b. Clearly, given the fixed relationship between 
n and b, the speedup will reach a plateau after the number 
of CPU cores reaches a certain level. 

Parallel BayesCpC for whole-genome prediction 

Finally, we show a real application of parallel computing 
on genome-enabled prediction in beef cattle. The com- 
puting was implemented with a high-throughput com- 
puting pipeline called parallel-BayesCpC [30]. This is a 



high-throughput computing package and a member of 
the WGSE (Whole-Genome-enabled Selection and 
Evaluation) family [31] of distributed high-throughput 
computing pipelines. In computing, a pipeline is a set of 
data processing elements connected in series (i.e., the 
output of one element is the input of the next one) and 
the elements of a pipeline can be executed in parallel or 
sequentially. Typically, pipelining increases the comput- 
ing throughput. In the context of whole-genome predic- 
tion, the pipeline that we have developed can automate 
all steps involved in the computing and decision making 
for whole-genome prediction, which includes and is not 
limited to data input and quality control, model feature 
selection (FS) if applicable, post-FS statistical inference 
and cross-validation (CV), and output and documenta- 
tions (Figure 4a). The parallel-BayesCpC package is so 
named because it uses the BayesCn model for FS and 
the BayesC { n = 0) model for post-FS statistical infer- 
ence and CV (Figure 4b). This package can reside on 
both Condor and OSG (Open Science Grid) and is pro- 
vided with a Condor SOAR web interface for automatic 
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scheduling of jobs and storage of output files (Figure 4c). 
In Condor, for example, job dependency can be 
conveniently handled as so-called "DAGMan jobs" 
(Appendix - b). Note that the user does not need to 
know how to write Condor job batch files as all these 
files will be automatically produced by scripts of 
BayesCpC based on the user's input. 

For simplicity of illustration, consider a linear model 
with only the overall mean and marker effects: 

p 

yi = n+ Z z ij a j+ e i (17) 

where y ; is the phenotype (or estimated genetic value) 
of the animal; ^ is the overall mean; Uj is the substitu- 
tion effect associated with the SNP ( / = 1, . . . , p); 
z iy is a variable corresponding to the genotype of the /'* 
SNP (0, 1, 2) for the individual, and e ~ Af(0, a I) 
is a residual term, where I is the residual variance. A 
priori, the BayesCn model assumes that the effect of an 
SNP is null with probability n, or that it follows a nor- 
mal distribution, N(Q, a ^), with probability 1- n. That 
is, 

« . l\ ^ ^(''' ^ probability {l-n) 

' " \ = 0 with probability n ^ ' 

Here, cr ^ is a variance common to all non-zero SNP 
effects, which is assigned a scaled inverse chi-square distri- 
bution, x~^^{'^ a)- Furthermore, the value of n is un- 
known and needs to be inferred, given the prior 
distribution of n that is taken as uniform between 0 and 1, 
7r~ Umform{0,l). A Bernoulli indicator variable, S j, is 
introduced to facilitate sampling from the mixtures of 
the SNP effects, p{5i\ n) = n^" ^'{1 - n) ^' . Hence, 
unconditionally, the variable aj follows a multivariate-t 
distribution, t(0, S^, Va), if 6j = 1, or equals zero 
otherwise [32]. Posterior inference of unknown parameters 
in the Bayesian model via Markov chain Monte Carlo im- 
plementation is described by Habier et al. [11]. 

With a subset of, say k< p markers selected in a cer- 
tain iteration of the MCMC for the BayesC n model, 
then the next iteration assumes that all the k selected 
SNP have non-null effects on the quantitative trait. The 
above defines BayesC model with n = 0, which takes 
the same form as (15) but with n = 0 and p replaced by 
k. Posterior inference in BayesC {n = 0) is the same as 
for BayesC n, except that n is fixed at zero and sampling 
indicator variables is no longer relevant. 

Typically, K-fo\d CV is often used to evaluate predict- 
ive models, in which the whole dataset is divided into K 
portions of approximately equal size. The model is then 
trained in the set of K-1 portions of the data and pre- 
dicted in the remaining one portion. Portioning of 



training and testing sets is then rotated K times in each 
CV experiment. Furthermore, each CV experiment can 
be randomly replicated a number of times in order to in- 
crease the stability of model evaluation (but we did not 
do that in the present study). 

As an example, we used the parallel BayesCpC pack- 
age to select the optimal number of SNP to predict rib 
eye area in a beef cattle population. The data consisted 
of 2919 animals each with estimated breeding values for 
rib eye area and genotypes obtained from the lUumina 
50K SNP Beadchip. After data editing and data quality 
control, 46 723 polymorphic SNP markers were used. 
The optimal panel search started with the top 50 SNP 
markers according to the posterior model probability for 
having a non-zero effect for this marker, and then 
increased from the top 100 to the top 3000 markers at 
an increment of 100 markers each time. We did not ex- 
haust all possible panels beyond 3K because the predic- 
tion accuracy showed a constant trend of decrease after 
the panel reached 1000 SNP. 

In the parallel-BayesCpC package, MPI is used for data 
quality control and parallel MCMC is employed by both 
FS and CV. In our example, distributed jobs were sub- 
mitted to a local Condor pool with 128 cores. These are 
Linux workstations, with Intel " Xeon " CPU (2.67 GHz), 
> 8G memory per slot and a cache size of 12 288 KB. 
The submit machine is also a Linux workstation with 
Intel® Xeon® CPU (3.00 GHz), 16G memory and 6144 
KB cache size. Each parallel MCMC chain for feature se- 
lection consisted of 10 000 iterations after a burn-in of 
2000 iterations, thinned every one-tenth. Each parallel 
MCMC for CV consisted of 25 000 iterations, after a 
burn-in of 5000 iterations, thinned every one-tenth. 

We examined three computing strategies to search for 
an optimal panel size for predicting genetic merit using 
SNP markers. The first strategy executed 30 meta-jobs 
in parallel, each consisting of one round of parallel FS 
jobs using all SNP markers and one round of parallel 
post-FS inference and CV for a specific panel (X = 50, 
100, 200, . . ., 3000, respectively). In the second strategy, 
one round of parallel FS jobs was executed, followed by 
30 rounds of parallel post-FS inference and CV jobs con- 
ducted sequentially for all panel sizes. Parallel CV jobs 
for the 500-SNP panel started only when parallel CV 
jobs for the 400-SNP jobs had finished. The third strat- 
egy consisted of 30 meta-jobs executed in series, with 
each meta-job consisting of one round of parallel FS jobs 
using all SNP markers and one round of parallel post-FS 
inference and CV for a specific panel. The difference be- 
tween the second and the third strategy is that different 
optimal panel sizes were selected based on the same set 
of FS results with the second computing strategy but 
each panel was selected based on a different set of FS 
results with the third computing strategy. 
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Figure 4 Whole genome-enabled selection and evaluation (WGSE) pipelines and parallel BayesCpC. (a) Graphic illustration of the WGSE 
pipelining; (b) worl<flow; (c) Condor SOAR webpage of the parallel BayesCpC pipeline. 



There were very significant differences in computing 
time between the three computing strategies (Figure 5a). 
The first strategy consumed the least time (mostly com- 
pleted within 12 h), because it used more features of 
parallel computing, but it also required far more slots 
for computing (in total 90 slots were needed). The sec- 
ond strategy avoided the use of many slots because it 
ran only one round of parallel FS jobs and then executed 
30 rounds of parallel post-FS inference and CV jobs 



sequentially based on the same set of FS results. We ran 
three parallel chains for FS and also used 3-fold CV to 
evaluate prediction accuracy and, hence, only three slots 
were needed for this strategy. Because CV jobs on a sub- 
set of markers typically ran much faster than a FS job on 
all markers, it was computationally efficient to run these 
CV jobs sequentially. The computing time necessary for 
the second strategy was approximately two times greater 
than for the first strategy. However, the third strategy 
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Figure 5 Computing time by three parallel computing strategies (a) and prediction correlations by the first strategy (b). Three parallel 
computing strategies were used in search of optimal SNP panel sizes for predicting genetic merit. The first strategy executed 30 meta-jobs in 
parallel, each consisting of one round of parallel feature selection (FS) jobs using all SNP markers and one round of parallel post-FS inference and 
CV for a specific panel (X = 50, 100, 200, . . ., 3000, respectively); In the second strategy, one round of parallel FS jobs was executed, followed by 
30 rounds of parallel post-FS inference and CV jobs conducted sequentially for all panel sizes (P1_A); The third strategy consisted of 30 meta-jobs 
executed in series, with each meta-job consisting of one round of parallel FS jobs using all SNP markers and one round of parallel post-FS 
inference and CV for a specific panel (Pn_A). 



consumed the greatest amount of time (over two weeks). 
With this strategy, jobs for different panel sizes were 
executed in series but FS jobs and post-FS inference and 
CV jobs for each panel size were executed in parallel. If 
all these jobs were executed sequentially, the computing 
time necessary would exceed one month and half, and 
this is definitely not optimal. Comparatively, the first 
computing strategy was twice as fast as the second strat- 
egy and 29 times faster than the third strategy. 

Despite the differences in computing times, predic- 
tions obtained with the three computing scenarios were 
highly comparable. The correlation between estimated 
breeding values for rib eye area and their fitted values in 
the training set (referred to as fitting accuracy hereafter) 
increased almost monotonically with panel size, until it 
plateaued with a panel size of around 2000 SNP 
(Figure 5b). However, the correlation between the esti- 
mated breeding values and their predicted values in the 
testing set (referred to as predictive accuracy hereafter) 



reached its peak (0.8886) with a panel size of 1000 SNP, 
and then went down slightly. The highest predictive 
accuracy was observed with 500 to 1500 selected markers. 
The decrease in predictive accuracy with > 1000 SNP pos- 
sibly reflected over-fitting, which, in this case, occurred 
much before the panel size exceeded the training popula- 
tion size (i.e., around 2000 animals). Hence, with Bayesian 
regression models, prediction using more SNP may not 
necessarily give better results than prediction using a 
smaller panel. A model that describes the training set well 
does not necessarily yield the best predictions when gen- 
eralized to the population. This is referred to as poor 
generalization in machine learning [33]. The fitting and 
prediction accuracies are illustrated in Figure 6 for vari- 
ous panel sizes. 

In the Bayes CpC procedure, the BayesC n model postu- 
lates that a portion n of all SNP have zero effect. In a 
high-density SNP panel, n is typically expected to be large, 
meaning that the portion of "signal" SNP, 1- tt, is small 
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Figure 6 Prediction accuracies obtained from different panel sizes, (a) 100 SNP; (b) 1000 SNP; (c) 2000 SNP; (d) 3000 SNP. 



(Figure 7), so the chance of over-fitting is diminished. 
Using the Illumina BovineSOK SNP genotypes, the poster- 
ior mean (standard deviation) of 1 — n was 0.0148 
(0.0027). In prediction, the best predictions were obtained 
with 500 to 1500 selected SNP, supporting an optimal pre- 
dictive ability with 1.07% to 3.21% selected markers. Inter- 
estingly, this optimal range covered the posterior mean of 
1 — n, which is the portion of markers estimated to have 



non-zero effects on the trait. This result differs somewhat 
from what we obtained using 3K SNP panels (data no 
published yet). For the 3K genotypes, the estimated num- 
ber of SNP having non-zero effects based on BayesC n in 
the training set did not correspond to the number of SNP 
in the optimal SNP panel for prediction. Hence, we postu- 
late that parameter tt in a BayesC n model may not pro- 
vide information on the size of an optimal SNP panel for 
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prediction for small panels but could be informative for a 
higher density of markers. 

With the Bayesian regression models explored here, 
feature selection may be important since a model with 
all SNP does not necessarily give the best predictions. 
This situation is unlike ridge regression best linear un- 
biased prediction [34] or prediction using the G-BLUP 
method [35], for which a model with all markers would 
typically be favored. While selecting models of varying 
dimensions may be an issue to explore, it brings tremen- 
dous challenges to computing, particularly when the 
dataset is large. In this regard, high-performance com- 
puting offers a markedly competitive edge, not only in 



reducing computing time but also in tuning optimal 
models for whole-genome prediction. 

Discussion 

To date, almost all statistical software packages for animal 
breeding and genetics have been developed for serial com- 
putation. In such programs, only one instruction is exe- 
cuted at a time and after that instruction is finished, the 
next instruction begins. Hence, serial computing perform- 
ance depends heavily on the speed (clock frequency) of 
CPU, and the runtime of a program is approximately 
equal to the number of instructions multiplied by the 
average time per instruction. Keeping everything else con- 
stant, higher clock frequency leads to faster computing 
speed and thus decreased runtime for all computation- 
bounded programs [36]. This was the situation with the 
performance enhancement of microprocessors based on a 
single CPU from the 1980s to the early 2000s. However, 
rate of improvement has slowed down since 2003 due to 
hardware limitations incurred by energy consumption and 
heat dissipation. On the other hand, parallel computing 
has gained impetus as a result of increasingly available 
multiple-core computers, computer clusters, and network- 
ing and has been referred to as the concurrency revolution 
[37]. In theory, multiple threads of execution can cooper- 
ate to complete the work faster than a serial setting. 

Parallel computing uses multiple processing elements 
concurrently to solve a problem. To implement parallel 
computing, one first needs to break the problem into 
discrete "chunks" of work, so that they can be distributed 
to run on multiple processors. This is known as task de- 
composition or partitioning. The next fundamental step 
in designing the parallel algorithm involves identifying 
independencies that are assumed explicitly in the model. 
Without loss of generality, let P, and Pj be two program 
jobs, where i < j indexes the order of execution. Bern- 
stein's conditions [38] can be used to identify whether or 
not the two jobs are independent and can be executed in 
parallel [39]. Let /, and O, be the input variables and out- 
put variables, respectively, of P,. Likewise, the same defi- 
nitions hold for Ij and O, of Pj . Then, P, and Pj are 
independent if they satisfy (1)/; n O; = 0, (2)/; n Oy = 
0, and (3) 0^0 O; = 0. Violation of condition (1) 
introduces a flow dependency because the results from 
the first job (P,) are used by the second job {Pj). Violation 
of condition (2) introduces an anti-dependency, because 
Pj would overwrite a variable needed by P, . The third 
condition represents an output dependency: when two 
statements write to the same location, the final result is 
determined by the last executed statement. 

MCMC algorithms have revolutionized the application 
of Bayesian inference, because it tackles a large range of 
complex inferential problems that were previously not 
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considered possible, tractable. In the meantime, statisti- 
cians are becoming ever more ambitious in the range 
(complexity) of models they consider and MCMC algo- 
rithms for large complex models often require enormous 
amounts of computing power. Consequently, effective ex- 
ploitation of parallel algorithms is of high relevance to 
Bayesian computation. A difficulty, however, is that 
MCMC algorithms are serial by nature and do not easily 
migrate onto a parallel system. Nevertheless, various strat- 
egies can be used to design parallel MCMC methods. The 
key is to identify steps with data independence or condi- 
tional independence on which parallelism can reside. Sev- 
eral algorithms or strategies exist for running parallel 
MCMC, which is straightforward and it requires minimal 
inter-process communication. However, poorly mixing 
MCMC algorithms with long burn-in periods are not 
ideally suited to this situation, because a long period of 
burn-in must be repeated on every available CPU core. 
Thus, it is often desirable to explore strategies for parallel- 
ism within single chains. 

The actual performance of parallel MCMC depends 
on several issues. Among them, inter-process communi- 
cation is a primary factor. Many parallel applications re- 
quire processes to share data with each other, which is 
known as inter-task communication and implies over- 
head. Inter-task communication can offset the gain in 
computing speed from parallel computing, because it 
frequently requires some type of synchronization be- 
tween tasks, causing processes to spend time "waiting". 
In the worst case, competing communication traffic can 
saturate the available network bandwidth, leading to 
poor parallel computing performance. Thus, there is al- 
ways a need to balance the distribution of work among 
tasks. There is no simple rule for this type of load balan- 
cing. Ideally, all tasks are to be kept busy all of the time, 
so that task idle time is minimized [9,10]. 

The ratio of computation to communication is qua- 
litatively measured by the concept of granularity 
(https:// computing.llnl.gov/ tutorials/ parallel_comp/). On 
one hand, a low computation to communication ratio 
(fine-grain parallelism) facilitates load balancing, as rela- 
tively small amounts of computational work are done be- 
tween communication events but this can imply high 
communication overhead and less opportunity for per- 
formance enhancement, because communication and 
synchronization between tasks may take longer than the 
computation. On the other hand, a high computation to 
communication ratio (coarse-grain parallelism) allows 
more opportunity for performance enhancement; because 
relatively large amounts of computational work are done 
between communication and synchronization events. 
However, it is difficult to balance loads efficiently with 
coarse-grain parallelism and computing time may differ 
dramatically between computer cores. Therefore, there is 



a tradeoff between computing and communication and 
the optimal granularity depends on the problem at hand. 
In most parallel MCMC problems, it is advantageous to 
have coarse granularity because the overhead associated 
with communication and synchronization is high relative 
to execution speed, but fine-grain parallelism can some- 
times help reduce overhead due to load imbalance. 

Conclusions 

In this paper, we have shown the principles and exam- 
ples of parallel MCMC, with applications to whole- 
genome prediction of breeding values. Parallel comput- 
ing operates on the principle that a large problem can 
be split into smaller components and solved concur- 
rently (i.e., "in parallel"), each on a separate processor 
(or CPU core). In the context of parallel MCMC, two 
basic algorithms exist: running multiple chains and 
parallelism within a single chain, yet some variants can 
be useful as well. In principle, all Bayesian models can 
be parallelized in computing but the associated algo- 
rithms and strategies may differ, leading to varied 
computing efficiencies. Although many technical details 
have yet to be explored, we expect that the use of par- 
allel MCMC methods will revolutionize computational 
tools for research and breeding programs for animals 
in the post-genome era. 

Appendix 

(a) Example C code using MPI for parallel simulation of 
a single-parameter normal model with unknown mean 
and known variance 

/* This is an example C program using MPI for paral- 
lel computing of * 

* a single-parameter normal model with unkown mean 
and known * 

* variance. For illustration purpose, the step for data 
input is omitted. * 

* Instead, the sample mean and standard deviation, as 
well as the prior ' 

* mean and standard are used. * 

* Contact: X-L Wu, nick.wu@ansci.wisc.edu; UW- 
Madison, 09-12-2011 */ 

#include < stdio.h> 

#include < math.h> 

#include < mpi.h> 

#include "ranNum.h" 

main(int argc, char **argv) 

{int proc_id, root_process, nprocs, ierr, niters, i; 

double xi, xi2, sum, psum, sum_xi2, psum_xi2; 

double tarO, tarl, sdn, tarn, varn, mun, mumu, sdmu; 

MPI_Status status; 

/* Prior mean and standard deviation */ 
double muO = 4.000; 
double sdO = 1.000; 
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/* sample size, mean and variance */ 

int nind = 7670; 

double mul = 3.394; 

double sdl = 0.580; 

/* compute posterior statistics */ 

tarO = 1.0/(sdO*sdO); 

tarl = (1.0*nind)/(sdl*sdl); 

varn = 1.0/(tarO + tarl); 

sdn = sqrt(varn); 

mun = varn * (tarO*muO + tarl*mul); 
/* Parallel simulation begins, starting from here */ 
/* process 0 as the root process. */ 
rootprocess = 0; 

/* Replicate this process to create parallel processes. */ 
ierr = MPI_Init(&argc, &argv); 
/* Allocate memory for random seed variable*/ 
long* idum; 

MPI_Alloc_mem(sizeof(long),MPI_INFO_NULL, 
&idum); 

/* Find out process ID and number of participating 
processes. */ 

ierr = MPI_Comm_rank(MPI_COMM_ WORLD, 
&proc_id); 

ierr = MPI_Comm_size(MPI_COMM_WORLD, 
Scnprocs); 

/* Root process gets the number of simulation steps */ 
if(proc_id == root_process) { 

printf("Please enter the number of simulation steps: "); 
scanf("%i", &niters);} 

/* Broadcast the number of simulation steps to all par- 
ticipating processes */ 

ierr = MPI_Bcast(&niters, 1, MPI INT, root process, 
MPI_COMM_WORLD); 

/'process id as the random number seed */ 

*idum = proc_id; 

/* Each process computes a partial sum of simulated 
values */ 
psum = 0; 
psum_xi2 = 0; 

for(i = proc_id + 1; i < niters + 1; i + = nprocs) { 
xi = mun + sdn * randomnormal(idum); 
xi2 = xi * xi; 
psum = psum + xi; 
psum_xi2 = psum_xi2 + xi2;} 

printf("proc %i computes: %An", proc_id, (float)psum); 

/* Do a reduction in which all partial sums are com- 
bined into the grand sum */ 

ierr = MPI_Reduce(&psum, &sum, 1, MPI_DOUBLE, 

MPI_SUM, root_process, MPI_COMM_WORLD); 

ierr = MPI_Reduce(&psum_xi2, &sum_xi2, 
l,MPI_DOUBLE, 

MPI_SUM, root_process, MPI_COMM_WORLD); 

/* Finally, the root process prints posterior mean and 
standard error of mu. */ 



if(proc_id == root_process) { 
mumu = sum / niters; 

sdmu = sqrt((sum_xi2 - niters*mumu*mumu)/(niters-l)); 
printf("The posterior mean and variance of mu is %f 
and %{\n", (float)mumu,(float)sdmu);} 
/* Close down this processes. */ 
ierr = MPI_Finalize();} 

(b) Condor job batch files (Note: these Condor scripts 
were written automatically be a R scripts based on a input 
parameter file. Using the BayesCpC package, a user does 
not need to write this type of Condor job batch files.) 

# Condor DAGMan job 
JOB input job lnputData 
JOB selection job Selection 

SCRIPT POST selection /usr/local/wgse_beta/V2/ 
cmdpostscriptSelectionSummary 
JOB validation job Validation 

SCRIPT POST validation /usr/local/wgse_beta/V2/ 
cmdpostscriptOutputResults 
PARENT input CHILD selection 
PARENT selection CHILD validation 

# Data input and quality control 
Universe = vanilla 

Executable = /usr/local/wgse_beta/V2/rungeneric.pl 

Arguments = —new —type = R — tarball = built-sl5-R- 
2.10.1.tar.gz — installfrom = R-2.10.1 --cmdtorun = plda- 
ta_input_n_processing.R —unique = input 

Log = step_input.log 

Output = step_input.out 

Error = step_input.error 

notification = NEVER 

should_transfer_files = YES 

when_to_transfer_output = ONEXIT 

transfer_input_files =(data path and files are omitted), 
/home/nickwu/BayesCpC/ V2_UW/data/datafile.csv, 
/ usr/local/wgse_beta/V2/rungeneric.pl, /usr/local/wgse_- 
beta/V2/SLIBS.tar.gz, /usr/local/wgse_beta/V2/cmd_da- 
ta_input_n_processing, /usr/local/wgse_beta/V2/ 
pl_data_input_n_processing.R, 

Queue 

# Feature selection 
Universe = vanilla 

Executable = /usr/local/wgse_beta/V2/rungeneric.pl 

Arguments = —new --type = R —tarball = built-sl5-R- 
2.10.Ltar.gz —installfrom = R-2.10.1 --cmdtorun = 
p4_BayesCpC_validation.R —unique = validationl 1 

Log = step_validation.${process).log 

Output = step_validation.$ (process). out 

Error = step_validation.$(process).error 

notification = NEVER 

should_transfer_files = YES 

when_to_transfer_output = ONEXIT 

transfer_input_files = .linkPar, 
/usr/local/wgse_beta/V2/rungeneric.pl, / usr/local/wgse_ 
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beta/V2/SLIBS.tar.gz, /usr/local/wgse_beta/V2/ 
cmd_cross_validation, / usr/local/wgse_beta/V2/ 
p4_BayesCpC_validation.R, 
Queue 

Arguments = —new —type = R — tarball = built-sl5-R- 
2.10.1.tar.gz — installfrom = R-2.10.1 — cmdtorun = 
p4_BayesCpC_validation.R --unique = validation_2 2 

Queue 

Arguments = —new —type = R —tarball = built-sl5-R- 
2.10.1.tar.gz —installfrom = R-2.10.1 —cmdtorun = 
p4_BayesCpC_validation.R --unique = validation s 3 

Queue 

# Post-FS inference and cross-validation 
Universe = vanilla 

Executable = /usr/local/wgse_beta/V2/rungeneric.pl 

Arguments = —new —type = R --tarball = built-sl5-R- 
2.10.1.tar.gz —installfrom = R-2.10.1 —cmdtorun = 
p4_BayesCpC_validation.R --unique = validation_l 1 

Log = step_validation.$(process).log 

Output = step_validation.$(process).out 

Error = step_validation.$(process) .error 

notification = NEVER 

should_transfer_files = YES 

when_to_transfer_output = ON_EXIT 

transfer_input_files = .linkPar, /usr/local/wgse_beta/V2/ 
rungeneric.pl, /usr/local/wgse_beta/V2/SLIBS.tar.gz, 
/usr/local/ wgse_beta/ V2/ cmd_cross_validation, /usr/ 
local/wgse_beta/V2/p4_BayesCpC_validation.R, 

Queue 

Arguments = —new —type = R —tarball = built-sl5-R- 
2.10.1.tar.gz —installfrom = R-2.10.1 —cmdtorun = 
p4_BayesCpC_validation.R --unique = validation_2 2 

Queue 

Arguments = —new —type = R —tarball = built-sl5-R- 
2.10.1.tar.gz —installfrom = R-2.10.1 —cmdtorun = 
p4_BayesCpC_validation.R --unique = validation_3 3 

Queue 
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